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Abstract 

This  paper  describes  a  scheme  to  answer  best-match  queries  from  a  file  containing  a  collection  of  objects. 
A  best-match  query  is  to  find  the  objects  in  the  file  which  are  closest  (according  to  some  similarity  measure) 
to  a  given  target. 

Previous  work  [4,  22]  suggested  that  one  can  reduce  the  number  of  comparisons  required  to  achieve  the 
desired  results  using  the  triangle  inequality,  starting  with  a  well-formed  data  structure  for  the  file  which 
reflects  some  precomputed  intra-file  distances.  We  generalize  this  technique  to  allow  data  structures  with 
any  kind  of  topology,  that  is,  we  allow  an  arbitrary  set  of  object  pairs  in  the  file  with  absent  distances.  A 
Floyd-Warshall  style  algorithm  is  used  to  achieve  the  optimal  approximation  of  those  unknown  distances. 
A  heuristic  for  choosing  objects  out  of  the  file  to  compare  against  the  target  is  presented.  Artificial  data 
and  actual  protein  sequences  are  used  to  illustrate  our  scheme,  to  demonstrate  some  weaknesses  of  the 
technique,  and  to  show  the  better  performance  of  our  scheme  over  others  when  given  the  same  data  struc- 
ture. 
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bounds,  topology 
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1      Introduction 

The  best-match  problem  (also  known  as  the  "post  office  problem  [16],"  or  the  "nearest  neighbor  problem  [11]") 
arises  in  many  applications,  such  as  database  management  [13],  information  retrieval  [4,  7,  22],  molecular 
biology  [15,  17],  pattern  classification  [11,  14,  18]  and  computational  geometry  [16].  The  problem  is  to  find, 
among  a  file  of  objects,  the  ones  which  are  most  similar  or  closest  to  a  given  target  according  to  some  similarity 
distance  measure.  To  answer  this  type  of  queries,  one  could  compute  the  distance  between  each  object  of  the 
file  and  the  target,  and  then  search  for  the  objects  with  minimum  distance.  The  major  problem  with  this 
approach  is  its  computational  expense,  particularly  when  there  are  many  targets  to  be  identified  and  the  file 
is  large.  To  reduce  computational  effort,  many  techniques  have  been  presented  in  the  literature. 

Du  and  Lee  [6],  for  example,  used  Gray  code  as  a  multikey  hashing  function  for  finding  closest  symbolic 
records.  The  similarity  measure  they  used  is  the  Hamming  distance.  Feustel  and  Shapiro  [8]  described  a 
method  for  determining  the  nearest  neighbor  to  an  unknown  graph  in  an  abstract  metric  space.  Shamos 
[21]  employed  the  Voronoi  diagram  (a  general  structure  for  searching  a  plane)  to  the  best-match  problem  for 
records  with  two  keys  (two  dimensions)  and  a  Euclidean  distance  measure.  Fukunaga  and  Narendra  [11]  used 
a  branch  and  bound  algorithm  to  find  nearest  neighbors  to  a  test  pattern  in  a  multidimensional  space.  The 
distance  measure  used  is  also  Euclidean  metric.  The  same  method  has  been  improved  in  [14,  18].  These  papers 
are  mainly  concerned  with  specific  distance  metrics,  usually  Euclidean. 

Some  papers  use  special  file  structures.  Friedman  et  al.  [10],  for  instance,  partitioned  a  d-dimensional 
feature  space  using  the  k  —  d  tree  method  and  searched  by  descending  the  tree.  Eastman  and  Weiss  [7] 
proposed  a  modified  <:  — rf  tree  for  high  dimensional  spaces.  Ito  and  Kizawa  [13]  devised  HL  files  (/hierarchically 
organized  files  based  on  a  Linear  ordering)  for  spelling  correction  applications.  They  gave  a  recursive  procedure 
to  search  the  file  for  retrieving  similar  strings. 

There  are  strategies  that  make  a  weaker  assumption  on  distance  measures,  namely,  they  only  satisfy 
the  fundamental  properties  of  a  metric.  Burkhard  and  Keller  [4],  for  example,  proposed  cut-off  procedures 
to  reduce  computational  effort  simply  using  the  triangle  inequality.  Sufficient  conditions  for  objects  to  be 
eliminated  from  consideration  while  searching  a  file  were  given.  The  objects  they  considered  are  rather 
general  and  could  be  any  ones  on  which  distance  functions  can  be  defined.  Shapiro  [22]  later  improved  their 
method  by  precomputing  more  distances  between  objects  in  the  file  and  by  deriving  stricter  cut-off  criteria 
for  eliminating  objects.  This  paper  improves  on  these  techniques. 

Briefly  summarized,  all  approaches  share  the  following  characteristics: 

•  Distance  computation  is  costly.  As  a  result,  they  intend  to  skip  such  computation  as  much  as  possi- 
ble during  searching  (an  objective  similar  to  [12,  19-20,  23],  where  efforts  were  devoted  to  saving  the 
overlapping  computation  of  spatial  objects). 

•  To  achieve  this  goal,  they  perform  preprocessing  on  the  data.  This  includes  precomputing  some  distances 
between  objects  in  a  file  and/or  organizing  the  file  into  some  structure. 

•  An  algorithm  (or  heuristic)  is  proposed  to  choose  objects  out  of  the  file  to  compare  against  the  target. 

1.1      Our  Approach 

We  assume  a  true  distance  metric,  that  is,  a  function  d  that  takes  pairs  of  objects  into  nonnegative  numbers, 
satisfying  the  following  three  properties:  for  any  objects  Oj,  O2,  O3,  c/(Oi,  O2)  >  0,  and  d{Oi,  O2)  =  0  iff  Oj 


=  Oo  (non-negative  definiteness);  d{Oi,  O2)  =  d(02,  Oi)  (symmetry);  d{Oi,  O2)  <  d(Oi,  O3)  +  ^(Oa,  Oo) 
(triangle  inequality).  We  assume  no  further  structure  (e.g.  k-dimensionality)  on  the  metric. 

We  include  a  preprocessing  possibility:  all  inter-object  distances  within  a  database  have  been  precomputed 
and  stored  in  a  distance  map.  This  gives  us  a  "best  possible"  prior  information.  Using  the  triangle  inequality, 
we  develop  cut-off  procedures  to  reduce  the  number  of  comparisons  while  searching  the  databcise. 

We  then  generalize  this  technique  to  allow  an  arii/rarj/ number  of  object  pairs  in  the  database  with  unknown 
distances.  The  motivation  for  considering  this  problem  is  that  at  times  we  may  be  given  a  set  of  distances 
which  have  been  calculated,  but  we  were  not  able  to  choose  which  ones.  We  use  a  Floyd- Warshall  [9,  25]  style 
algorithm  to  approximate  those  unknown  distances.  When  searching  for  closest  objects,  we  apply  the  triangle 
inequality  whenever  possible  to  the  known  distance  values  (including  the  ones  found  during  the  search)  to 
derive  successively  tighter  bounds  on  the  distances  between  unexamined  object  pairs.  These  bounds  often  cut 
down  significantly  the  number  of  objects  one  should  examine  later.  We  show  that  our  scheme  is  optimal  in  the 
sense  that  it  makes  the  best  use  possible  (in  a  sense  to  be  clear)  of  the  distance  information  that  is  available. 
We  also  show  through  experiments  that  our  scheme  has  better  performance  than  Burkhard's  methods  when 
given  the  same  distance  information. 

2     Distance  Map 

Let  2)  be  a  database  composed  of  n  objects.  For  convenience,  we  assume  these  objects  are  numbered  from  1 
to  n  and  refer  to  the  object  numbered  i  simply  as  O;.  A  distance  map  for  I>  is  an  n  x  n  matrix  DM ,  where 
DM[i,j]  is  the  distance  between  O;  and  Oj .  To  start,  suppose  that  DM  is  precomputed  in  its  entirety. 

The  best-match  problem  is  defined  as  follows:  Given  a  target  T,  find  the  closest  pairs  (T,  O),  V  O  G  X^,  i.e. 
{O  G  X>  I  d{0,  T)  <  d{0',  T),  W  O'  E  V}.  With  the  use  of  the  triangle  inequality,  one  can  often  reduce  the 
number  of  comparisons  required  to  achieve  the  results  for  this  query.  For  example,  consider  T  and  a  database 
of  three  objects  shown  in  Figure  1.  Suppose  d{T,  Oi)  =  3  has  been  computed.  We  get  d{T,  O2)  >  d(Oi,  O2)  - 
d(T,  Oi)  =  5,  and  d{T,  O3)  >  d{Oi,  O3)  -  d{T,  Oi)  =  6.  Thus  we  can  conclude  that  Oi  is  closest  to  T  without 
computing  d{T,  Oi),  i  =  2,  3.  Notice  also  that  computing  d{T,  Oo)  (or  d(T,  O3))  first  would  require  additional 
distance  computations.  This  then  raises  the  following  optimization  problem:  What  is  the  smallest  set  S  =  {Oi 
I  d{T,Oi)  is  actually  computed}  C  V  such  that  every  object  outside  S  is  farther  from  T  than  some  object  in  S? 


11       (  ^    6)     ,'-' 


3  "--^ 


Figure  1.  The  integer  on  edge  (O,  O')  represents  dist(0,  O').  The  dashed  edge 
means  the  distance  has  not  been  computed  yet.  Integers  inside  the  parentheses 
are  lower  bounds  for  those  uncomputed  distance  values. 


Lemma  1.  Every  object  outside  S  ts  farther  from  T  than  some  object  in  S  is  equivalent  to:  'iOj  G  V  —  S,30k  6 
S,  such  thai  \d{Oj,Ok)  -  d{T,Ok)\  >i  where  (  =  mmo.es{d(T,Oi)}. 

Proof.  If  c/(Oj,  Ok)  >  d{T,  Ok)  +  ^  then  d{Oj,  O,,)  -  d(T,  Ok)  >  ^  by  rearrangement,  and  the  left  hand  side 
is  a  lower  bound  on  d{T,  Oj)  by  the  triangle  inequality. 

If  d{Oj,  Ok)  <  d(T,  Ok)  -  (  then  ^  <  d(T,  Ok)  -  d{Oj ,  Ok)  and  the  right  hand  side  is  a  lower  bound  on 
d{T,  Oj).        D 

We  proceed  in  stages  to  seek  a  set  S  with  smallest  size  possible.  At  each  stage,  an  object  is  chosen  and 
its  distance  from  T  is  computed.  Based  on  this  distance  value,  we  can  derive  a  useful  lower  bound  on  the 
distances  between  the  target  and  some  objects  that  haven't  been  computed. 

Lemma  2.  Lei  A',-  fee  the  object  chosen  at  stage  i,  and  di  =  d(T ,  Xi).  Lei  m  be  the  m.inimum  distance 
value  obtained  from  previous  stages,  i.e.  m  =  min  {dj  \  i  <  j  <  i  —  1} .  Then  for  all  objects  O  £  V  —  {Xj\ 
1  <i  <  0- 

(1)  When  di  >  m,  d(T,  O)  >  m  ifd(0,  X,)  >  df  +  m  or  d(0,  X,)  <  di  -  m. 

(2)  When  di  <  m,  d(T,  O)  >  di  ifd(0,  Xi)  >  2di,  or 
d(0,  Xj)  >  dj  +  di  or  d(0,  Xj)  <  dj  -  di,  V  1  <  >  <  f  -  1. 

Proof  By  repeated  application  of  Lemma  1.       D 

The  sufficient  conditions  given  in  the  above  lemma  provide  cut-off  criteria,  based  on  which  one  can  eliminate 
some  objects  (i.e.  those  whose  lower  bounds  have  been  greater  than  m)  from  consideration  at  each  stage  (see 
Figure  2).  Note  that  in  the  case  (1)  above,  we  do  not  consider  objects  that  are  farther  from  Xj  than  dj  +  d, 
or  closer  to  it  than  dj  —  di,  because  they  all  have  been  eliminated  in  previous  stages. 

Each  stage  filters  out  irrelevant  objects  and  only  retains  those  objects  needed  in  future  stages.  Figure  3 
summarizes  the  algorithm.' 

I  :—  T>;  m  :=  oo; 
repeat 

pick  an  object  O'  in  /; 
calculate  d  =  d(T,  0')\ 
\i  d  >  m  then 

5  :=  {O  I  (d{T,  O)  is  not  computed) 

A  (J-  m  <  d{0,  O')  <  d-\-  m) 
else 

5  :=  {O  I  {d{T,  O)  is  not  computed) 
A  {d{0,  O')  <  2d) 

A  (V  A';  where  d{T,Xj)  =  dj  is  computed,  dj  -  d  <  d(0,  Xj)  <  dj  +  d)}; 
end  if 

m  :=  min  {m,  d); 
I  :=  I  ns 
until  7  =  0; 

Figure  3.  Algorithm  SEARCH 


'To  keep  the  presentation  concise,  we  only  show  how  to  find  the  minimum  distance  between  T  and  objects  in  V.  Finding  the 
closest  objects,  however,  is  an  eaisy  variemt  of  this  approach. 


T 
dl 


di-m 


dl+m 


Case  1:  When  d,  >  m,  throw  out  objects  in  the  ranges  (0,  di-m)  and  (d,  +m, 
go),  as  they  are  farther  from  T  than  m. 


T 

♦ 

dl 

XI 


\  i-dl 


2dl 


Case  2:  When  d;  <  m,  (step  1)  throw  out  objects  in  the  range  (2(f,,  oo). 


dj 


Xj 


Nidi--- 


dj-di 


dj+dl 


Case  2:  (step  2)  Since  d,  becomes  the  new  minimum  distance  value,  also  discard 
objects  in  the  ranges  (0,  dj-d,)  and  (d^+di,  oo)  (if  they  haven't  been  discarded 
yet). 

Figure  2.  X,  (Xj )  is  the  origin.  Points  on  the  positive  x  axis  represent  objects 
in  v.  The  x  coordinate  of  object  O  represents  the  distance  between  O  and  the 
origin.  Objects  not  on  bold  lines  will  be  discarded. 


2.1      A  Search  Heuristic 

To  avoid  a  "blind"  search,  we  have  developed  and  analyzed  several  heuristics  that  help  pick  a  candidate  object 
(i.e.  objects  that  are  still  in  /)  at  each  stage  [24].  We  present  here  only  the  best  heuristic,  which  we  call  pick 
least  lower  bound. 

The  heuristic  picks  its  first  object  randomly.  In  subsequent  stages,  it  selects  a  candidate  O'  such  that  the 
lower  bound  of  the  distance  between  O'  and  the  given  target  T  is  minimized  based  on  all  previous  candidates. 
(These  lower  bounds  are  obtained  by  d{T,  O')  >  \d(T,  O")  -  d{0" ,  0')\  where  O"  represents  objects  picked 
in  previous  stages.)  Intuitively  this  heuristic  uses  the  lower  bound  to  estimate  the  exact  distance.  Thus  the 
object  having  the  least  lower  bound  is  expected  to  be  (potentially)  the  closest  object  to  T.  If  several  candidates 
have  the  same  lower  bound,  the  heuristic  selects  one  that  has  the  least  upper  bound.  (The  upper  bound  is 
obtained  by  d(T,  O')  <  d{T,  O")  +  d[0" ,  O').)  The  reason  for  doing  so  is  that  we  expect  the  smaller  the 
difference  between  the  lower  and  upper  bounds,  the  more  precise  the  estimated  distance  is.  If  there  is  still  a 
tie  on  the  difference,  the  heuristic  arbitrarily  picks  one. 

In  Section  4,  we  will  carry  out  experiments  to  compare  this  heuristic  with  the  heuristics  proposed  by  other 
researchers.  We  will  see  that  our  heuristic  is  an  improvement  on  earlier  work  and  requires  only  20%  more 
searches  than  the  fewest  possible,  according  to  simulation. 

3      Partial  Distance  Map 

The  search  procedure  discussed  in  the  previous  section  assumes  a  complete  distance  map.  If  only  a  partial  map 
is  given,  i.e.,  there  exist  Oi,  Oj  whose  distance  is  absent,  it  will  be  beneficial  to  first  estimate  those  unknown 
distances  before  processing  queries.  In  this  section,  we  show  how  to  obtain  such  an  approximate  distance  map 
{ADM)  and  how  it  can  be  used  for  searching. 

3.1      Approximate  Distance  Map 

Each  entry  [i,  j]  in  the  ADM  either  shows  the  exact  distance  between  Oi  and  Oj,  or  (if  it  is  not  computed) 
gives  a  lower  bound  for  d(0,,  Oj).  This  bound  may  be  rather  crude  (i.e.  too  low)  initially,  and  it  will  be 
gradually  refined  after  new  distances  between  T  and  objects  in  V  are  computed. 

To  compute  the  ADM ,  we  start  by  constructing  a  weighted  undirected  graph  on  V,  such  that  there  is  an 
edge  between  O,  and  Oj  iff  d(Oi,  Oj)  has  been  computed.  If  there  is  such  an  edge  e,  its  weight,  denoted  u;(e), 
is  the  computed  distance.  We  define  a  path  from  O;  =  Oi^  to  Oj  =  Oi^  as  a  sequence  of  distinct  objects 
Oi,  ,Oij, . . . ,  Oj^  such  that  {Oi, ,  Oij},  {O;,,  0,3}, . . . ,  {0,„_,,  O,,^}  are  edges  in  the  graph  and  the  weight  of 
the  path  is  the  sum  of  weights  of  its  constituent  edges. 

Lemma  3.  (Generalized  Triangle  Inequality)  Suppose  there  is  a  path  P  from  Oi  to  Oj.  Let  e  he  the  edge 
of  maximum  weight  in  P.   Then 

d{Oi,Oj)>w(e)-      J2      '^(«)- 

eeP-{i] 

Proof.  By  induction  on  the  number  of  objects  in  P  and  repeated  application  of  the  triangle  inequality.        D 


The  above  lemma  states  that  one  can  obtain  a  lower  bound  for  d(0,,  Oj)  by  applying  the  triangle  inequality 
to  a  path  from  Oi  to  Oj .  Of  course,  such  a  bound  is  useless  if  the  term  on  the  right  hand  side  of  the  inequality 
is  less  than  or  equal  to  0.  Generally,  we  want  this  bound  to  be  as  high  as  possible.  Let  V  be  the  set  containing 
all  paths  from  Oi  to  Oj .  We  define  ADM[i,  j]  to  be  the  maximum  bound  obtained  from  all  paths  in  V. 
Apparently,  ADM[i,j]  =  d{0,,  0_,)  if  edge  {O,,  Oj}  G  V. 

It  is  impractical,  in  general,  to  enumerate  all  of  the  paths  in  V  to  get  ADM[i,j],  for  there  may  be  an 
exponential  number  of  them.  Instead,  we  use  a  dynamic  programming  technique  similar  to  the  transitive 
closure  algorithm  [25]  to  compute  the  ADM.  To  facilitate  the  computation,  we  also  maintain  an  additional 
matrix  MIN,  where  MIN[i,j]  is  the  minimum  weight  of  any  path  from  Oi  to  Oj.  Thus,  MIN[i,j]  basically 
gives  the  least  upper  bound  of  the  distance  between  O,-  and  Oj.  Clearly,  MIN[i,  j]  -  d{Oi,  Oj)  when  the 
value  is  computed. 

Following  [3],  we  define  ADMk[i,j],  0  <  it  <  n,  to  be  the  greatest  lower  bound  offered  by  the  paths  from 
Oi  to  Oj  that  do  not  pass  through  an  object  numbered  higher  than  k,  and  MINk[i,j]  as  the  minimum  weight 
of  all  such  paths. 

Lemma  4.  Let  Sk(i,j),  J  <  k  <  n,  denote  the  set  of  paths  going  from  O,  to  Ok  and  then  from  Ok  to 
Oj,  without  passing  through  an  object  numbered  higher  than  k.  Suppose  Sk{i,j)  ^  0-  Let  Bk{i,j)  be  the 
greatest  lower  bound  obtained  by  applying  the  generalized  triangle  inequality  to  all  the  paths  in  Sk{i,j).   Then 

r,  ,•    -x  (   ADMk-i[i,k]-MINk-i[k,j]  ,       ,^u^ 

Bk{i,j)  =  ma.  I   ^^,,;_;}.' ,j  _  MINk-ki]  for  I  <  k  <  n 

Proof  Let  P  G  Sk(i,j)  be  a  path  yielding  Bk{i,j).  Let  Pi  be  the  segment  of  P  between  O,  and  Ok,  and 
P2  be  the  segment  of  P  between  Ot  and  Oj .  Suppose  first  that  the  edge  e  of  maximum  weight  is  in  Pi .  By 
Lemma  3,  we  get 

Bk{i,j)  =  w{e)-      Yl      w{e)-^w(e). 

Claim  that 

ADMk-i[i,k]  =  w{e)-       ^       w{e). 

By  induction, 

ADMk-i[i,k]>wie)-      ^      w{e), 

eePi-{e] 

and  if  inequality  held,  we  could  construct  a  path  P'  in  Sk{iJ)  by  concatenating  a  path  P[,  which  yields 
ADMk-i[i,k],  and  P2.  The  bound  achieved  by  P'  would  be  greater  than  Bk{i,j),  contradicting  the  definition 
oi  Bk{i,j).  By  an  analogous  argument, 

MINk-i[kJ]=  X^  ^e). 

eeP2 

Thus,  Bk(iJ)  =  ADMk.i[i,k]  -  MINk-i[k,j]. 

If  e  is  in  P2,  symmetric  arguments  yield  Bk(i,j)  =  ADMk-i[j,k]  -  MINk-i[k,i].        0 

From  the  above  lemma,  we  have,  for  each  k, 


ADMk[i,j]  =max|  g^^^j) 
Moreover  [3], 

These  formulae  give  rise  to  a  Fioyd-Warshall  style  algorithm  for  computing  the  approximate  distance  map. 
The  procedure  is  given  in  Figure  4. 

for  J  :=  1  to  n  do 

for  j  :=  1  to  j  -  1  do 

if  DM[i,j]  is  present  then  begin 

ADM[iJ]  :=  DM[i,j]-  MIN[i,j]  :=  DM[iJ] 
end 
else  begin 

ADM[i,j]  :=  0;  MIN[i,j]  :=  oo 
end; 
for  k  :=  1  to  n  do 
for  i  :=  2  to  n  do 

for  j  :=  1  to  I  —  1  do  begin 

ADM[i,j]  :=  max  {ADM[i,j],  ADM[i,k]  -  MIN[k,j],  ADM[j,k]  -  MIN[k,i])- 
MIN[i,j]  :=  min  {MIN[i,j],  MIN[i,k]  +  MIN[k,j]y, 
end; 

Figure  4.  Algorithm  APPROXIMATE 

Notice  that  due  to  the  symmetry,  we  only  compute  the  lower  triangular  part  of  the  matrices.  Also,  in  the 
algorithm, 

ADM[i,k]=  ^  ^^^^^1^.^     otherwise 

This  holds  for  MIN  as  well. 

Using  induction  on  k,  we  obtain 

Theorem  1.  Algorithm  APPROXIMATE  correctly  computes  matrices  ADM  and  MIN;  that  is,  it  achieves 
the  optimal  distance  approximation  in  the  sense  that  the  lower  (resp.  upper)  bound  of  any  path  going  from  Oi 
to  Oj  is  less  (resp.  greater)  than  or  equal  to  ADM[i,  j]  (resp.  MIN[i,  ]]),  given  the  distances  that  have  been 
computed.       D 

3.2      Searching  Using  An  ADM 

Recall  that,  in  algorithm  SEARCH,  set  I  is  used  to  hold  candidates  (i.e.  objects  that  haven't  been  computed, 
nor  been  discarded)  needed  for  next  satges.  An  obvious  drawback  of  applying  this  algorithm  directly  to  a 
partial  distance  map  is  that  the  /  obtained  at  stage  i  needs  to  include  all  objects  O's  for  which  d{Xi,  O) 
doesn't  exist.  This  may  make  /  unnecessarily  large.  With  the  aid  of  an  ADM  (and  MIN),  one  can  reduce 
the  size  of  I  by  excluding  the  objects  O's  where  the  lower  bound  of  the  distance  between  A',-  and  O  is  already 
greater  than  d,  +  m  or  the  upper  bound  is  less  than  di  —  m  when  di  >  m,  or  the  lower  bound  is  greater  than 
2di  when  rf,  <  m  (recall  that  m  is  the  current  minimum  distance  value).  To  put  it  another  way,  if  we  augment 


the  ADM  with  an  additional  row,  say  row  n  +  1,  for  object  T  (i.e.  treating  T  as  0„  +  i)  so  that  ADM[n  +  1, 
i]  always  gives  the  current  greatest  lower  bound  for  d{T,  Oi)^  then  /  will  only  contain  the  objects  O^'s  whose 
ADM[n  +  1,  i]  is  still  less  than  or  equal  to  m.  Figure  5  gives  the  algorithm. 

/  :=  V;  in  :=  oo; 

initialize  ADM  and  MIN  as  done  in  Figure  4; 

augment  ADM  and  MIN  with  an  additional  row  for  object  T; 

repeat 

choose  an  object  O'  in  /  using  pick  least  lower  bound; 

calculate  d{T,  O'); 

update  the  augmented  ADM  and  MIN; 

m  :=  min  {m,  d(T,  O')  }; 

/  :=  {O,  I  (d(T,  O.)  is  not  computed)  A  (ADM[n  +!,!]<  m)) 
until  7  =  0; 

Figure  5.  Algorithm  PARTIAL  SEARCH 

Note:  Starting  with  an  optimal  approximate  distance  map  (Theorem  1),  the  algorithm  developed  here  is  best 
possible  for  the  best-match  problem  in  the  sense  that  given  an  object  at  stage  i,  it  throws  out  all  the  objects 
that  can  be  inferred  to  be  irrelevant  to  the  solution  at  that  stage.  What  may  influence  the  performance  of  the 
algorithm  is  the  heuristic  utilized  in  selecting  objects  at  each  stage  -  the  better  the  heuristic  (or  the  better 
our  luck),  the  better  performance  the  algorithm  achieves. 

3.3     Updating  Augmented  ADM  and  MIN 

Each  computation  of  the  distance  between  T  and  some  object  Ok  may  lead  to  modifications  of  the  augmented 
ADM  and  MIN.  Observe  that  the  value  of  d{T,  Ok)  affects  only  the  paths  going  through  {T,  Ok]-  Let 
L  (U)  be  the  new  lower  (upper)  bound  of  the  paths  from  Oj  to  Oj  via  {T,  Ojt};  similarly  to  Lemma  4,  we  obtain 

d(T,  Ok)  -  MIN[i,  n  +  1]  -  MIN[k,j] 
d(T,  Ok)  -  MIN[i,  k]  -  MIN[j,  n  +  1] 
ADM[i,n+  1]  -  d{T,Ok)  -  MIN[k,j] 
ADM[i,k]  -  d{T,Ok)  -  MIN[n  +  \,j] 
ADM\j,  k]  -  d{T,  Ok)  -  MIN[i,  n+l] 
ADM[n+  l,j]-d(T,Ok)-MIN[i,k] 


L  =  max  < 


U  =  min 


r  MIN[i, n  +  1]  +  d{T,  Ok)  +  MIN[k,i] 
\   MIN[i,k]  +  d(T,Ok)  +  MIN[n+\,j] 


Thus,  after  computing  d{T,  Ok),  to  find  the  new  (tighter)  bounds  for  the  distances  between  objects  Oi, 
Oj  e  {T}  U  V,  it  suffices  to  compare  ADM[i,j]  (MIN[i,j])  with  L  (U)  (recall  that  ADM[n  +  l,i]  always 
gives  the  current  greatest  lower  bound  for  d{T,  Oi)). 

Note  that  we  update  only  the  pairs  whose  distances  are  still  unknown.  For  those  pairs  of  objects  whose 
distances  have  been  calculated,  the  distance  values  already  represent  both  the  best  lower  bounds  and  upper 

^We  shall  discuss  how  to  update  such  an  augmented  ADM  in  Section  3.3.  For  now  let  us  assume  that  this  map  can  somehow 
be  maintained. 


bounds,  and  hence  they  need  not  be  modified.    Calculating  L  and  U  takes  only  constant  time.    Thus  the 
overhead  incurred  by  updating  a  map  is  negligible  when  the  map  is  almost  complete. 

If,  however,  there  exists  a  large  portion  of  object  pairs  whose  distances  are  absent,  the  recomputation 
would  be  quite  expensive.  In  such  a  situation,  we  could  incrementally  update  the  maps  using  the  techniques 
described  in  [2,  5],  or  could  only  update  the  bounds  for  pairs  (T,  O,),  O;  G  V,  while  keeping  the  initial  bounds 
for  (O,,  Oj),  Oi,  Oj  G  v.  (This  strategy  is  similar  to  the  one  suggested  in  [1]  for  maintaining  shortest  paths 
in  a  sizable  graph.)  This  will,  of  course,  make  set  /  somewhat  larger  because  we  now  have  weaker  bounds  for 
(T,  Oi)  than  we  would  if  we  updated  all  the  pairs  with  unknown  distances  at  each  stage.  Whether  the  gain 
obtained  by  globally  updating  is  substantial  enough  to  offset  the  incurred  overhead  can  only  be  determined 
by  experimental  work.  In  the  next  section,  we  will  see  that  the  gain  is  rarely  worth  it. 

4     Performance  Analysis 

We  have  performed  a  series  of  experiments  to  evaluate  the  effectiveness  of  our  search  scheme  and  its  relative 
performance  to  those  proposed  previously.  Table  I  shows  the  basic  parameters  used  in  the  experiments. 


Parameter 

Meaning 

Size 
Density 
MinDistance 
MaxDistance 

Number  of  objects  in  the  database 
Portion  of  known  distances  in  the  map 
Minimum  distance  between  objects 
Maximum  distance  between  objects 

Table  I.  Experimental  Parameters 

[MinDistance,  MaxDistance]  specifies  the  range  over  which  distances  between  objects  are  distributed. 
The  Density  parameter  represents  the  portion  of  known  distances  in  a  map,  and  is  computed  by  dividing 
the  number  of  object  pairs  with  known  distances  by  the  total  number  of  object  pairs  in  the  corresponding 
database.  To  compare  different  algorithms  for  the  best-match  query,  we  used  the  following  metric: 

NumComputed  „ 

PERFO  = X  100% 

Size 

where  N umComputed  is  the  number  of  objects  whose  distances  from  the  target  have  been  computed.  PERFO 
stands  for  PE Rcentage  of  brute  FOrce  cost.  One  would  like  this  percentage  to  be  as  low  as  possible. 

The  sample  maps  used  in  each  run  in  the  experiments  were  generated  as  follows.  We  first  obtained  a 
{Size  +  1)  X  (Size  +  1)  consistent  map,  with  each  entry  being  a  random  number  over  some  positive  interval. 
By  "consistent",  we  mean  that  the  map  satisfies  the  conditions  of  a  distance  metric.  All  entries  except  those 
on  the  outermost  row  (and  column)  constituted  a  complete  map,  and  values  on  the  outermost  row  represented 
distances  between  the  target  and  objects  in  the  database.  The  reason  for  generating  all  these  distances  in  the 
very  beginning  rather  than  generating  them  when  needed  was  to  ensure  that  the  triangle  inequality  held  not 
only  among  the  objects  in  the  database,  but  also  among  them  and  the  target.  In  producing  a  partial  map, 
we  randomly  selected  Density  x  Size  x  {Size  —  1)  entries  out  of  the  complete  map  as  known  distances  and 
kept  aside  the  distances  stored  in  the  other  positions.  The  latter  values  were  added  to  the  map  only  when  the 
algorithm  decided  to  perform  their  computation  during  searching. 

To  keep  the  analysis  tractable,  most  experiments  were  performed  with  consistent  maps  of  size  150.  Nev- 
ertheless, our  empirical  study  indicated  that  the  larger  the  map,  the  more  effective  our  scheme  became. 
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4.1      Effect  of  Density 

In  the  first  set  of  experiments,  we  examined  the  effect  of  varying  densities  on  PERFO.  The  distances  be- 
tween objects  were  drawn  randomly  from  a  uniform  distribution  with  the  range  [0,  10,000].^  In  each  run,  we 
created  a  complete  map  first,  and  executed  algorithm  SEARCH  on  that  map.  We  then  gradually  decreased 
the  density,  running  algorithm  PARTIAL  SEARCH  on  the  resulting  maps.  In  order  to  see  the  effect  of  not 
globally  updating  maps  after  each  distance  computation,  we  also  ran  two  variations  on  each  map:  one  is  to 
update  the  bounds  for  object  pairs  (T,  O),  O  €  X>,  and  the  other  updates  only  the  bounds  for  (7",  O),  where 
O  is  still  a  candidate.  The  results  are  reported  in  Table  II,  with  each  figure  representing  the  average  of  30  runs. 


Density 

Updating  policy 

Global 

Linear 

Candidate 
only 

0.01 

0.05 

0.1 

0.5 

0.7 

0.9 

81.9% 
54.4% 
44.8% 
26.8% 
25.1% 
24.3% 

82.1% 
54.8% 
44.9% 
27.9% 
25.9% 
24.6% 

82.4% 
54.9% 
44.9% 
28.5% 
26.2% 
25.1% 

1.0 

23.5% 

Global:  update  (O,  O'),  O,  O'  €  {T]  U  V.  Linear:  update 
(r,  O),  O  eV.  Candidate  only:  update  (T,  O),  where  O  is  a 
candidate. 

Table  II.  Effect  of  Updating  Policies  on  PERFO  for  Various  Map  Densities 

It  is  clear  that  PERFO  drops  (i.e.  improves)  as  the  density  of  a  map  increases.  The  dropping  rate  tends 
to  be  slow  after  the  density  is  greater  than  0.5.  To  our  surprise,  all  the  three  updating  policies  have  very 
close  behavior.  This  happens  probably  for  two  reasons.  First,  the  edges  between  T  and  objects  in  V  occupy 
a  small  portion  of  the  entire  set  of  edges  and  hence  their  computation  could  not  significantly  improve  bounds 
for  absent  intra-database  distances.  Second,  better  bounds  between  T  and  candidate  objects  usually  come 
from  paths  going  through  recently  computed  edges,  not  from  those  through  discarded  objects.  This  result 
suggests  that  when  having  a  large  databcise,  one  should  update  only  the  bounds  on  the  distances  between  the 
target  and  candidates  after  each  comparison,  disregarding  objects  that  have  been  eliminated.  While  the  gain 
is  slightly  less,  the  total  computation  becomes  much  cheaper.  (This  may  be  important  if  one  occasionally 
needs  answers  quickly  to  the  best-match  problem  and  then  fills  in  the  distance  map  at  leisure.) 

We  then  compared  the  number  of  comparisons  our  heuristic  required  with  the  minimum  possible.  Empiri- 
cally, our  heuristic  was  within  20%  of  the  optimum  for  a  database  of  size  20.  This  result  is  encouraging,  given 
the  fact  that  we  have  no  way  of  knowing  a  pnori  what  the  best  distances  to  calculate  are. 

4.2     Non-Uniformly  Distributed  Distances 

To  see  the  effect  of  non-uniform  distribution  for  distances  between  objects,  we  have  performed  two  extremum 
experiments  on  maps  with  densities  1,  0.5,  0.01,  respectively  (they  represent  complete,  half  complete  and  very 
sparse  maps). 


'We  ran  the  experiments  on  several  other  ranges  sj\A  obtained  similar  results.  This  agrees  with  our  intuition  that  the  distance 
range  should  not  have  any  effect  on  the  performance  of  the  algorithms. 
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In  the  first  experiment,  we  considered  databases  in  wiiich  objects  are  far  from  one  another,  but  one  is 
very  close  to  the  target  T.  We  generated  (thirty)  distance  maps  as  described  earUer  except  that  the  distance 
between  an  (arbitrarily  selected)  object  and  T  was  drawn  randomly  from  the  range  [0,  100],  and  all  the  other 
distances  from  the  range  [1,000,  10,000].  Our  results  showed  that  the  values  of  PERFO  in  this  case  are 
very  low  (1.3%  for  Density  =  1,  3.4%  for  Density  =  0.5,  and  60.4%  for  Density  =  0.01,  respectively).  In 
particular,  for  complete  maps,  at  most  two  comparisons  were  needed  to  get  the  closest  object.  This  is  not 
surprising,  given  the  observation  that  the  pick  least  lower  bound  heuristic  can  always  make  the  right  choice 
after  its  first  try. 

In  the  second  experiment,  we  considered  databases  in  which  objects  are  close  to  one  another,  but  far  from 
T.  The  distances  between  objects  in  databases  were  drawn  randomly  from  the  range  [0,  1,000].  PERFO  is 
plotted  in  Figure  6  as  a  function  of  range  from  which  distances  between  objects  and  T  were  drawn,  where 
each  point  on  the  X  axis  gives  the  maximum  value  of  the  range  (the  minimum  value  was  fixed  at  10,000).'' 

From  the  figure  we  see  that  PERFO  drops  significantly  as  the  range  size  increases.  This  is  due  to  the  fact 
that  the  larger  the  size,  the  more  objects  closer  to  >Y,  than  d,  -  m  (or  closer  to  A'^  than  dj  -  d,)  there  are  (cf. 
Figure  2).  Consequently,  more  objects  can  be  discarded  at  each  stage.  (The  procedure  for  discarding  objects 
that  are  farther  away  from  Xi  than  cf,  +  m  (or  2(f,)  is  no  longer  effective  here.)  Notice  also  that,  under  this 
circumstance,  our  scheme  suff'ers  when  objects  in  a  database  are  extremely  close  to  one  another. 

4.3     Results  for  Protein  Data 

In  real  applications,  distances  among  objects  may  not  be  distributed  as  uniformly  as  in  Section  4.1,  nor  as 
extremely  as  in  Section  4.2.  To  see  how  well  our  scheme  applies  to  actual  database  systems,  we  have  run  our 
algorithms  on  a  set  of  proteins.  150  proteins  were  randomly  selected  from  the  sequence  database  of  Thinking 
Machines.  Each  protein  has  between  4  and  20  amino  acids.  (An  amino  acid  is  represented  by  a  numerical  or 
alphabetical  character.)  The  inter-protein  distances  were  computed  based  on  the  dayhoff  score  metric  [15].^ 

We  ran  our  algorithms  on  these  proteins  thirty  times,  each  time  using  a  randomly  selected  (distinct)  protein 
as  the  target.  Table  III  shows  the  means  and  standard  deviations  of  the  number  of  proteins  actually  computed 
for  various  densities. 


Density 

1 

0.9 

0.7           0.5 

0.1 

0.05 

0.01 

mea. 
dev. 
PERFO 

95 

66 

63.8% 

98 

62 

65.8% 

106          111 

53            46 

71.1%      74.5% 

138 

14 

92.6% 

143 

7 
96.0% 

147 
3 

98.7% 

Table  III.  Statistics  for  Protein  Data 

By  comparing  with  Table  II,  we  see  the  values  of  PERFO  obtained  from  the  proteins  are  higher  (i.e. 
worse)  than  those  from  generated  data.  A  close  look  at  the  data  reveals  why  this  happens.  In  this  sample 
database,  there  are  lots  of  small  clusters,  proteins  in  which  are  close  to  one  another.  All  the  other  proteins  are 


*Due  to  the  triangle  inequality,  the  difference  between  the  maximum  eind  minimum  values  in  this  case  cein  not  be  greater  than 
1 ,000. 

^The  dayhoff  score  metric  differs  from,  albeit  is  isomorphic  to,  a  distance  metric  in  the  sense  that  the  higher  the  score  between 
two  proteins,  the  closer  they  are.  We  used  the  following  formula  to  compute  the  distance  between  two  proteins  based  on  their 
scores:  d{pi ,  P2)  =  c  -  s(pi ,  p2 ),  where  c  is  an  empirical  constant  assuring  that  the  difference  satisfies  the  conditions  of  distance 
metrics,  and  s(pi,  P2)  is  the  score  between  proteins  pi  and  pj. 
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(roughly)  equally  distant  from  each  other  (and  from  those  in  the  clusters).  Thus,  when  targets  are  members  of 
clusters,  as  analyzed  in  the  previous  section,  only  several  tries  are  needed  to  get  the  closest  proteins;  whereas 
a  non-member  target  can  result  in  many  proteins  being  computed. 

4.4      Comparison  to  Previous  Methods 

In  this  subsection,  several  experiments  were  performed  to  compare  the  relative  performance  of  our  scheme 
to  those  proposed  by  Burkhard  k  Keller  [4]  and  Shapiro  [22],  with  the  condition  that  the  same  distance 
information  is  given.  The  experiments  were  carried  out  using  both  uniformly  distributed  data  and  actual 
proteins. 

Before  comparing  these  approaches,  let  us  first  make  a  distinction  on  their  precomputed  information,  search 
heuristic  and  cut-off  criteria. 

Burkhard  &:  Keller: 

Precomputed  distances:     d{0,  0°),  V  O  6  P,  for  a  randomly  chosen  object  0° 
Heuristic:     first  compute  d{T,  0°);  then  pick  Xk,k  =  1,  2,  3, . . . 

where  \d(Xt,0°)  -  d{T,0°)\  <  \d{Xk+uO°)  -  d{T,0°)\ 
Cut-off  criterion:       eliminate  O  if  \d{0,0°)  -  d(T,0°)\  >  m 

Shapiro: 

Precomputed  distances:     d{0,  O'),  V  O  £  P,  i  =  1,  2, . . . ,  s,  for  s  randomly  chosen  objects  O' 
Heuristic:     first  compute  d{T,0'),  i  =  1,2, ...  ,s;  then  pick  A't,  k  =  1,2,  3, . . . 

where  \d{Xk,0')  -  diT,0')\  <  \diXk+uO')  -  d{T,0')\ 
Cut-off  criterion:       eliminate  O  if  3,^^i  (|d(0,0')  -  d{T,0')\  >  m) 

Our  scheme: 

Precomputed  distances:  arbitrary 
Heuristic:  pick  least  lower  bound 
Cut-off  criterion:       eliminate  O  if  ADM[n  -I-  1,  O]  >  m 

The  objects  O',  i  =  0, 1, . . . ,  s  are  called  reference  points  in  [4,  22].  It  can  be  seen  that  Shapiro's  method 
is  an  extension  of  Burkhard's  with  more  than  one  reference  point.  The  weighted  graph  constructed  on  V  for 
the  latter  is  a  star,  whereas  there  are  s  stars  for  the  former,  each  centerd  with  a  different  reference  point.  (In 
the  following,  Burkhard  and  Shapiro's  methods  are  collectively  referred  to  as  BKS,  and  our  scheme  is  called 
SW.) 

Figure  7  illustrates  the  behavior  of  these  schemes  for  various  number  of  reference  points.  Each  point  of 
the  graphs  in  Figure  7  represents  the  average  of  30  runs.  The  reference  points  were  chosen  arbitrarily  and, 
as  suggested  in  [22],  were  kept  outside  clusters  when  running  proteins.  To  fully  exploit  the  precomputed 
distances,  pick  least  lower  bound  was  modified  to  first  calculate  the  distances  between  T  and  the  reference 
points,  and  then  pick  objects  with  the  least  lower  bound  in  subsequent  stages. 

Figure  8  shows  the  relative  performance  of  the  heuristic  employed  by  each  scheme.  To  isolate  the  effect 
of  cut-off  procedures,  we  used  the  same  BKS's  criterion  for  both  schemes.  Figure  9  shows  the  relative 
performance  of  the  cut-ofi"  procedures  employed  by  each  scheme,  where  the  same  BKS's  heuristic  was  used. 

From  these  figures  we  see  that  SW  outperforms  BKS.  The  superiority  primarily  comes  from  that  the 
former  has  a  better  search  heuristic.  The  different  cut-off  criterion  for  the  two  schemes  has  only  a  minor 
influence  on  performance  (Fig.  9). 
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4.5      Effect  of  Data  Structures 

An  interesting  question  arising  naturally,  based  on  Burkhard's  work,  is:  Given  a  fixed  number  of  distances, 
say  F,  that  one  is  allowed  to  calculate  beforehand,  what  is  the  best  set  of  distances  (or  edges)  to  choose? 
That  is,  what  kinds  of  data  structures  for  a  file  can  yield  best  performance?  We  have  tested  our  algorithms 
on  some  possible  candidate  structures  on  generated  data  (similar  results  were  obtained  for  proteins  and  are 
not  shown  here): 

•  Stars;  As  described  in  [4,  22]. 

•  Bipartite  graphs:  Objects  in  the  database  are  evenly  distributed  into  two  disjoint  groups  with  those  in 
group  i  being  denoted  as  O j  ,  i  =  1,2,  j  =  1, . . . ,  Size/2.  Edges  are  allocated  evenly  and  are  constructed 
according  to  the  following  program: 

for  i  :=  0,1,...  do  begin 
for  j  :=  1  to  Size/2  do 
compute  d(Oj ,  Oj+i ); 
for  j  :=  1  to  Size/2  do 
compute  d(Oj,0]^,); 
end; 

•  Cliques:  Arbitrarily  choose  k  objects  and  compute  distances  between  any  two  of  them,  with  the  constraint 
that  (  \  <  F.  Arbitrarily  choose  the  remaining  edges. 

•  Random  graphs:  Randomly  choose  F  edges. 

We  first  processed  these  structures  by  algorithm  APPROXIMATE  and  then  searched  based  on  the  resulting 
maps.  The  heuristic  pick  least  lower  bound  was  used,  with  a  slight  modification  for  cliques  where  objects  in 
a  clique  were  always  examined  prior  to  those  outside  the  clique.  (This  is  because  we  wanted  to  fully  exploit 
the  precomputed  information.)  Figure  10  illustrates  the  relative  performance  of  these  structures  for  various 
number  of  edges. 

It  is  evident  that  stars  outperform  all  the  other  structures  and  are  the  data  structure  of  choice.  Cliques 
behave  poorly.  This  is  mainly  due  to  the  fact  that  there  are  no  edges  between  objects  outside  a  cUque  and  hence 
the  distance  bounds  between  them  and  the  target  are  very  low;  consequently  few  objects  can  be  eliminated  at 
each  stage.  It  is  also  interesting  to  note  that  random  graphs  tend  to  be  better  than  bipartite  graphs  as  the 
number  of  edges  increases. 

5      Conclusions 

From  the  experimental  results,  we  observe  that: 

•  The  heuristic  pick  least  lower  6ounrf  outperforms  heuristics  proposed  in  previous  work  and  shows  a  better 
improvement  the  more  data  is  available.  Moreover  the  heuristic  can  make  use  of  whatever  distance 
information  is  available,  rather  than  being  dependent  on  a  particular  topology  of  computed  distances. 
When  compared  with  the  optimal  search  strategy,  it  requires  only  20%  more  distance  computations  than 
the  fewest  possible  for  uniformly  distributed  distances. 
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•  The  performance  of  all  schemes  is  not  only  dependent  on  the  density  of  a  map,  but  is  also  strongly 
influenced  by  how  distances  distribute  over  a  range.  For  instances,  when  distances  between  objects 
(including  the  target)  are  uniformly  distributed,  our  scheme,  with  the  aid  of  pick  least  lower  bound, 
performs  less  than  30%  of  the  necessary  computation  for  the  best-match  problem  even  if  only  50%  of  a 
map  is  computed.  On  the  other  hand,  when  objects  tend  to  cluster  into  numerous  small  groups  which 
are  (roughly)  equally  distant,  just  like  those  in  the  protein  database,  the  scheme  becomes  less  effective. 
For  this  case,  one  needs  to  perform  nearly  (74%)  63%  of  the  possible  computation,  on  the  average,  for 
(half)  complete  distance  maps. 

•  The  multiple  star  topology  of  computed  distances  proposed  by  Burkhard  et  al.  is  a  good  one.  We 
conjecture  it  is  the  best  one,  because  every  object  is  connected  to  every  other  by  a  set  of  length  two 
paths.  Theorists  interested  in  average  case  complexity  will  find  this  topology  question  to  be  an  intriguing 
problem,  we  think. 
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